Goal :
Our objective for this project is to clean the data first and foremost, address issues such as missing values, inconsistencies, and incorrect data types to ensure the dataset is ready for analysis, examine how different variables (e.g., age, sex, class, and fare) relate to one another and how they might influence other variables and discover patterns and trends in the data that can provide insights into factors affecting survival rates.
About Dataset :
The dataset was provided by Prodigy Info tech from kaggle. The Titanic dataset contains information about the passengers aboard the RMS Titanic, which sank on its maiden voyage in 1912. It was intended to be used for a machine learning competition so it includes 3 tables consisting of training data, testing data and what the predicted values should look like. We'll only be working with our training data since it's the only one with complete features.
Link to the dataset : Titanic
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
df=pd.read_csv('C:/Users/obalabi adepoju/Downloads/titanic/train.csv')
We'll first of all look at a general overview of our data followed by short descriptions of a few of its columns.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
print(f"The dataset consists of {df.shape[0]} rows and {df.shape[1]} columns")
The dataset consists of 891 rows and 12 columns
# Let's look at the a few of the records we have hear to gain an understanding of what we are dealing with.
df.head(10)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| 5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
| 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
| 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
We'll start with our passenger id column to ensure there are no duplicates.
We'll rename our column "ID" for ease.
df.rename(columns={"PassengerId":'ID'},inplace =True)
#Checking for duplicates
print(f"Duplicates : {df['ID'].duplicated().any()}")
Duplicates : False
# Next we'll examine our survived column to make sure it only consists of the values (0,1)
df['Survived'].unique()
array([0, 1], dtype=int64)
Next we'll be looking at our pclass column.
We'll rename it "class" for clarity and we'll view the distribution of status in our data.
df.rename(columns={'Pclass':'class'},inplace=True)
plt = px.histogram(df, x='class',title='Social Status Distribution',color='class')
plt.show()
df['class'].value_counts()
class 3 491 1 216 2 184 Name: count, dtype: int64
First Class: Tickets for first-class accommodations on the Titanic were very expensive. Prices ranged from about 150 to 4,350 dollars (equivalent to approximately 4,000 to 115,000 today, adjusting for inflation). This high cost was due to the luxurious amenities and services offered, including spacious cabins, fine dining, and exclusive access to various facilities.
Second Class: Second-class tickets were less expensive but still a significant expense, ranging from about 60 to 150 dollars (about 1,600 to 4,000 dollars today). Second-class passengers enjoyed a high level of comfort and service, though not as lavish as in first class.
Third Class: Third-class tickets were much more affordable, typically costing between 15 and 40 dollars (approximately 400 to 1,100 today). These accommodations were more basic but still provided reasonable comfort for the time.
Recall our data was grouped into 2 so the information present here might not 100 % reflect our overall data. Regardless, the information above likely explains why we have the lower class tickets occupying more than half of our data due to it being the cheapest option, although it doesn't factor in why we have first class as being more than the average class probably due to the incomplete data or that may just be the way our it is.
#Let's check out our gender column
plt = px.pie(df, names='Sex',title='Sex Distribution',hole=0.5)
plt.show()
Next we'll be looking at our age column.
As we previously saw, our age is a floating point variable, this is wrong as ages can only be integers, we'll first examine our values to check why this is the case.
data = df[(df['Age'] % 1 != 0) & (~df['Age'].isnull()) ].sort_values('Age',ascending=False)
data
| ID | Survived | class | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 116 | 117 | 0 | 3 | Connors, Mr. Patrick | male | 70.50 | 0 | 0 | 370369 | 7.7500 | NaN | Q |
| 152 | 153 | 0 | 3 | Meo, Mr. Alfonzo | male | 55.50 | 0 | 0 | A.5. 11206 | 8.0500 | NaN | S |
| 331 | 332 | 0 | 1 | Partner, Mr. Austen | male | 45.50 | 0 | 0 | 113043 | 28.5000 | C124 | S |
| 203 | 204 | 0 | 3 | Youseff, Mr. Gerious | male | 45.50 | 0 | 0 | 2628 | 7.2250 | NaN | C |
| 153 | 154 | 0 | 3 | van Billiard, Mr. Austin Blyler | male | 40.50 | 0 | 2 | A/5. 851 | 14.5000 | NaN | S |
| 525 | 526 | 0 | 3 | Farrell, Mr. James | male | 40.50 | 0 | 0 | 367232 | 7.7500 | NaN | Q |
| 148 | 149 | 0 | 2 | Navratil, Mr. Michel ("Louis M Hoffman") | male | 36.50 | 0 | 2 | 230080 | 26.0000 | F2 | S |
| 843 | 844 | 0 | 3 | Lemberopolous, Mr. Peter L | male | 34.50 | 0 | 0 | 2683 | 6.4375 | NaN | C |
| 122 | 123 | 0 | 2 | Nasser, Mr. Nicholas | male | 32.50 | 1 | 0 | 237736 | 30.0708 | NaN | C |
| 123 | 124 | 1 | 2 | Webber, Miss. Susan | female | 32.50 | 0 | 0 | 27267 | 13.0000 | E101 | S |
| 814 | 815 | 0 | 3 | Tomlin, Mr. Ernest Portage | male | 30.50 | 0 | 0 | 364499 | 8.0500 | NaN | S |
| 767 | 768 | 0 | 3 | Mangan, Miss. Mary | female | 30.50 | 0 | 0 | 364850 | 7.7500 | NaN | Q |
| 735 | 736 | 0 | 3 | Williams, Mr. Leslie | male | 28.50 | 0 | 0 | 54636 | 16.1000 | NaN | S |
| 57 | 58 | 0 | 3 | Novel, Mr. Mansouer | male | 28.50 | 0 | 0 | 2697 | 7.2292 | NaN | C |
| 676 | 677 | 0 | 3 | Sawyer, Mr. Frederick Charles | male | 24.50 | 0 | 0 | 342826 | 8.0500 | NaN | S |
| 296 | 297 | 0 | 3 | Hanna, Mr. Mansour | male | 23.50 | 0 | 0 | 2693 | 7.2292 | NaN | C |
| 227 | 228 | 0 | 3 | Lovell, Mr. John Hall ("Henry") | male | 20.50 | 0 | 0 | A/5 21173 | 7.2500 | NaN | S |
| 111 | 112 | 0 | 3 | Zabour, Miss. Hileni | female | 14.50 | 1 | 0 | 2665 | 14.4542 | NaN | C |
| 305 | 306 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.92 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
| 78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | male | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | S |
| 831 | 832 | 1 | 2 | Richards, Master. George Sibley | male | 0.83 | 1 | 1 | 29106 | 18.7500 | NaN | S |
| 469 | 470 | 1 | 3 | Baclini, Miss. Helene Barbara | female | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | C |
| 644 | 645 | 1 | 3 | Baclini, Miss. Eugenie | female | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | C |
| 755 | 756 | 1 | 2 | Hamalainen, Master. Viljo | male | 0.67 | 1 | 1 | 250649 | 14.5000 | NaN | S |
| 803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | male | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | C |
After cross referencing these names with multiple sources online, we've been able to confirm the ages of all of them and we'll now clean our data accordingly. We found that all ages that end in ".5" must have been a data entry error as the numbers before the decimal point are all correct. So we'll first of all deal with that.
data['Age']=data['Age'].astype(int)
After researching the names of the passengers starting with 0, we discovered they were either 3 or below.
# Now we'll manually enter the ages of the ones with 0
data.iloc[18:26,5] = pd.Series([1,1,1,3,3,1,1])
data.sort_values('ID',inplace=True)
data
| ID | Survived | class | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 57 | 58 | 0 | 3 | Novel, Mr. Mansouer | male | 28 | 0 | 0 | 2697 | 7.2292 | NaN | C |
| 78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | male | 1 | 0 | 2 | 248738 | 29.0000 | NaN | S |
| 111 | 112 | 0 | 3 | Zabour, Miss. Hileni | female | 14 | 1 | 0 | 2665 | 14.4542 | NaN | C |
| 116 | 117 | 0 | 3 | Connors, Mr. Patrick | male | 70 | 0 | 0 | 370369 | 7.7500 | NaN | Q |
| 122 | 123 | 0 | 2 | Nasser, Mr. Nicholas | male | 32 | 1 | 0 | 237736 | 30.0708 | NaN | C |
| 123 | 124 | 1 | 2 | Webber, Miss. Susan | female | 32 | 0 | 0 | 27267 | 13.0000 | E101 | S |
| 148 | 149 | 0 | 2 | Navratil, Mr. Michel ("Louis M Hoffman") | male | 36 | 0 | 2 | 230080 | 26.0000 | F2 | S |
| 152 | 153 | 0 | 3 | Meo, Mr. Alfonzo | male | 55 | 0 | 0 | A.5. 11206 | 8.0500 | NaN | S |
| 153 | 154 | 0 | 3 | van Billiard, Mr. Austin Blyler | male | 40 | 0 | 2 | A/5. 851 | 14.5000 | NaN | S |
| 203 | 204 | 0 | 3 | Youseff, Mr. Gerious | male | 45 | 0 | 0 | 2628 | 7.2250 | NaN | C |
| 227 | 228 | 0 | 3 | Lovell, Mr. John Hall ("Henry") | male | 20 | 0 | 0 | A/5 21173 | 7.2500 | NaN | S |
| 296 | 297 | 0 | 3 | Hanna, Mr. Mansour | male | 23 | 0 | 0 | 2693 | 7.2292 | NaN | C |
| 305 | 306 | 1 | 1 | Allison, Master. Hudson Trevor | male | 1 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
| 331 | 332 | 0 | 1 | Partner, Mr. Austen | male | 45 | 0 | 0 | 113043 | 28.5000 | C124 | S |
| 469 | 470 | 1 | 3 | Baclini, Miss. Helene Barbara | female | 3 | 2 | 1 | 2666 | 19.2583 | NaN | C |
| 525 | 526 | 0 | 3 | Farrell, Mr. James | male | 40 | 0 | 0 | 367232 | 7.7500 | NaN | Q |
| 644 | 645 | 1 | 3 | Baclini, Miss. Eugenie | female | 3 | 2 | 1 | 2666 | 19.2583 | NaN | C |
| 676 | 677 | 0 | 3 | Sawyer, Mr. Frederick Charles | male | 24 | 0 | 0 | 342826 | 8.0500 | NaN | S |
| 735 | 736 | 0 | 3 | Williams, Mr. Leslie | male | 28 | 0 | 0 | 54636 | 16.1000 | NaN | S |
| 755 | 756 | 1 | 2 | Hamalainen, Master. Viljo | male | 1 | 1 | 1 | 250649 | 14.5000 | NaN | S |
| 767 | 768 | 0 | 3 | Mangan, Miss. Mary | female | 30 | 0 | 0 | 364850 | 7.7500 | NaN | Q |
| 803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | male | 1 | 0 | 1 | 2625 | 8.5167 | NaN | C |
| 814 | 815 | 0 | 3 | Tomlin, Mr. Ernest Portage | male | 30 | 0 | 0 | 364499 | 8.0500 | NaN | S |
| 831 | 832 | 1 | 2 | Richards, Master. George Sibley | male | 1 | 1 | 1 | 29106 | 18.7500 | NaN | S |
| 843 | 844 | 0 | 3 | Lemberopolous, Mr. Peter L | male | 34 | 0 | 0 | 2683 | 6.4375 | NaN | C |
#Now we fix our original dataframe
df.update(data)
# We check to see if there are any other abnormal ages after correction
df[(df['Age'] % 1 != 0) & (~df['Age'].isnull()) ]
| ID | Survived | class | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|
We see we have successfully cleaned our data, let's now correct null values.
df['Age'].info()
<class 'pandas.core.series.Series'> RangeIndex: 891 entries, 0 to 890 Series name: Age Non-Null Count Dtype -------------- ----- 714 non-null float64 dtypes: float64(1) memory usage: 7.1 KB
As we can see, there are 177 null values, that's a whole lot of missing values which can't be inputed manually due to lack of information and neither can it be replaced because doing so would skew the distribution of ages in our dataset leading to inaccurate insights drawn when conducting our analyses.
So we'll leave these null values as they are and simply work with the ages present.
Let's now check out our distribution.
plt=px.violin(df,x='Age',title='Age Distribution',color_discrete_sequence=['mediumseagreen'])
plt.show()
Now, we want to check how many family relations every passenger had on the boat. We'll do this by creating a new column.
df['family'] = df['SibSp'] + df['Parch']
df.head()
| ID | Survived | class | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | family | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 |
#Let's now check out it's distribution.
plt=px.violin(df,x='family',title='Family Distribution',color_discrete_sequence=['royalblue'])
plt.show()
plt=px.histogram(df,x='Fare',title='Fare Distribution',color_discrete_sequence=['royalblue'])
plt.show()
Our fare prices are fairly concentrated between the 5 and 100 with the highest distribution being between 5 - 15 dollars, we do have a few outliers which (100 - 165), (200 - 265) and (505 - 515). Let's check out these values in our data.
Note that fare prices are directly affected by the total number of family associated with the passenger so each fare price for a passenger might or might not also include fare for family members or
df[df['Fare'] > 100].sort_values('Ticket')
| ID | Survived | class | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | family | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 377 | 378 | 0 | 1 | Widener, Mr. Harry Elkins | male | 27.0 | 0 | 2 | 113503 | 211.5000 | C82 | C | 2 |
| 390 | 391 | 1 | 1 | Carter, Mr. William Ernest | male | 36.0 | 1 | 2 | 113760 | 120.0000 | B96 B98 | S | 3 |
| 802 | 803 | 1 | 1 | Carter, Master. William Thornton II | male | 11.0 | 1 | 2 | 113760 | 120.0000 | B96 B98 | S | 3 |
| 763 | 764 | 1 | 1 | Carter, Mrs. William Ernest (Lucile Polk) | female | 36.0 | 1 | 2 | 113760 | 120.0000 | B96 B98 | S | 3 |
| 435 | 436 | 1 | 1 | Carter, Miss. Lucile Polk | female | 14.0 | 1 | 2 | 113760 | 120.0000 | B96 B98 | S | 3 |
| 498 | 499 | 0 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 3 |
| 297 | 298 | 0 | 1 | Allison, Miss. Helen Loraine | female | 2.0 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 3 |
| 305 | 306 | 1 | 1 | Allison, Master. Hudson Trevor | male | 1.0 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 3 |
| 708 | 709 | 1 | 1 | Cleaver, Miss. Alice | female | 22.0 | 0 | 0 | 113781 | 151.5500 | NaN | S | 0 |
| 319 | 320 | 1 | 1 | Spedden, Mrs. Frederic Oakley (Margaretta Corn... | female | 40.0 | 1 | 1 | 16966 | 134.5000 | E34 | C | 2 |
| 337 | 338 | 1 | 1 | Burns, Miss. Elizabeth Margaret | female | 41.0 | 0 | 0 | 16966 | 134.5000 | E40 | C | 0 |
| 698 | 699 | 0 | 1 | Thayer, Mr. John Borland | male | 49.0 | 1 | 1 | 17421 | 110.8833 | C68 | C | 2 |
| 581 | 582 | 1 | 1 | Thayer, Mrs. John Borland (Marian Longstreth M... | female | 39.0 | 1 | 1 | 17421 | 110.8833 | C68 | C | 2 |
| 306 | 307 | 1 | 1 | Fleming, Miss. Margaret | female | NaN | 0 | 0 | 17421 | 110.8833 | NaN | C | 0 |
| 550 | 551 | 1 | 1 | Thayer, Mr. John Borland Jr | male | 17.0 | 0 | 2 | 17421 | 110.8833 | C70 | C | 2 |
| 341 | 342 | 1 | 1 | Fortune, Miss. Alice Elizabeth | female | 24.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S | 5 |
| 438 | 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S | 5 |
| 88 | 89 | 1 | 1 | Fortune, Miss. Mabel Helen | female | 23.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S | 5 |
| 27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S | 5 |
| 730 | 731 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0 | 0 | 0 | 24160 | 211.3375 | B5 | S | 0 |
| 779 | 780 | 1 | 1 | Robert, Mrs. Edward Scott (Elisabeth Walton Mc... | female | 43.0 | 0 | 1 | 24160 | 211.3375 | B3 | S | 1 |
| 689 | 690 | 1 | 1 | Madill, Miss. Georgette Alexandra | female | 15.0 | 0 | 1 | 24160 | 211.3375 | B5 | S | 1 |
| 659 | 660 | 0 | 1 | Newell, Mr. Arthur Webster | male | 58.0 | 0 | 2 | 35273 | 113.2750 | D48 | C | 2 |
| 393 | 394 | 1 | 1 | Newell, Miss. Marjorie | female | 23.0 | 1 | 0 | 35273 | 113.2750 | D36 | C | 1 |
| 215 | 216 | 1 | 1 | Newell, Miss. Madeleine | female | 31.0 | 1 | 0 | 35273 | 113.2750 | D36 | C | 1 |
| 856 | 857 | 1 | 1 | Wick, Mrs. George Dennick (Mary Hitchcock) | female | 45.0 | 1 | 1 | 36928 | 164.8667 | NaN | S | 2 |
| 318 | 319 | 1 | 1 | Wick, Miss. Mary Natalie | female | 31.0 | 0 | 2 | 36928 | 164.8667 | C7 | S | 2 |
| 527 | 528 | 0 | 1 | Farthing, Mr. John | male | NaN | 0 | 0 | PC 17483 | 221.7792 | C95 | S | 0 |
| 118 | 119 | 0 | 1 | Baxter, Mr. Quigg Edmond | male | 24.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | 1 |
| 299 | 300 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | 1 |
| 31 | 32 | 1 | 1 | Spencer, Mrs. William Augustus (Marie Eugenie) | female | NaN | 1 | 0 | PC 17569 | 146.5208 | B78 | C | 1 |
| 195 | 196 | 1 | 1 | Lurette, Miss. Elise | female | 58.0 | 0 | 0 | PC 17569 | 146.5208 | B80 | C | 0 |
| 332 | 333 | 0 | 1 | Graham, Mr. George Edward | male | 38.0 | 0 | 1 | PC 17582 | 153.4625 | C91 | S | 1 |
| 268 | 269 | 1 | 1 | Graham, Mrs. William Thompson (Edith Junkins) | female | 58.0 | 0 | 1 | PC 17582 | 153.4625 | C125 | S | 1 |
| 609 | 610 | 1 | 1 | Shutes, Miss. Elizabeth W | female | 40.0 | 0 | 0 | PC 17582 | 153.4625 | C125 | S | 0 |
| 311 | 312 | 1 | 1 | Ryerson, Miss. Emily Borie | female | 18.0 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C | 4 |
| 742 | 743 | 1 | 1 | Ryerson, Miss. Susan Parker "Suzette" | female | 21.0 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C | 4 |
| 334 | 335 | 1 | 1 | Frauenthal, Mrs. Henry William (Clara Heinshei... | female | NaN | 1 | 0 | PC 17611 | 133.6500 | NaN | S | 1 |
| 660 | 661 | 1 | 1 | Frauenthal, Dr. Henry William | male | 50.0 | 2 | 0 | PC 17611 | 133.6500 | NaN | S | 2 |
| 679 | 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.0 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C | 1 |
| 737 | 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.0 | 0 | 0 | PC 17755 | 512.3292 | B101 | C | 0 |
| 258 | 259 | 1 | 1 | Ward, Miss. Anna | female | 35.0 | 0 | 0 | PC 17755 | 512.3292 | NaN | C | 0 |
| 716 | 717 | 1 | 1 | Endres, Miss. Caroline Louise | female | 38.0 | 0 | 0 | PC 17757 | 227.5250 | C45 | C | 0 |
| 380 | 381 | 1 | 1 | Bidois, Miss. Rosalie | female | 42.0 | 0 | 0 | PC 17757 | 227.5250 | NaN | C | 0 |
| 557 | 558 | 0 | 1 | Robbins, Mr. Victor | male | NaN | 0 | 0 | PC 17757 | 227.5250 | NaN | C | 0 |
| 700 | 701 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18.0 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C | 1 |
| 307 | 308 | 1 | 1 | Penasco y Castellana, Mrs. Victor de Satode (M... | female | 17.0 | 1 | 0 | PC 17758 | 108.9000 | C65 | C | 1 |
| 505 | 506 | 0 | 1 | Penasco y Castellana, Mr. Victor de Satode | male | 18.0 | 1 | 0 | PC 17758 | 108.9000 | C65 | C | 1 |
| 373 | 374 | 0 | 1 | Ringhini, Mr. Sante | male | 22.0 | 0 | 0 | PC 17760 | 135.6333 | NaN | C | 0 |
| 325 | 326 | 1 | 1 | Young, Miss. Marie Grice | female | 36.0 | 0 | 0 | PC 17760 | 135.6333 | C32 | C | 0 |
| 269 | 270 | 1 | 1 | Bissette, Miss. Amelia | female | 35.0 | 0 | 0 | PC 17760 | 135.6333 | C99 | S | 0 |
| 544 | 545 | 0 | 1 | Douglas, Mr. Walter Donald | male | 50.0 | 1 | 0 | PC 17761 | 106.4250 | C86 | C | 1 |
| 537 | 538 | 1 | 1 | LeRoy, Miss. Bertha | female | 30.0 | 0 | 0 | PC 17761 | 106.4250 | NaN | C | 0 |
Worthy to note all significantly higher prices are all from first class tickets with multiple people sharing the same tickets and they either embarked from southampton or cherbourg. This would explain the outliers as first class tickets are generally more expensive and the higher numbers also contribute to the higher fare prices and note that this isn't oue entire dataset so it's very likely there are more people associated with the tickets than shown.
#Our last column to inspect is the embarked column, we'll first check to see if there are null values
df['Embarked'].info()
<class 'pandas.core.series.Series'> RangeIndex: 891 entries, 0 to 890 Series name: Embarked Non-Null Count Dtype -------------- ----- 889 non-null object dtypes: object(1) memory usage: 7.1+ KB
#Just 2, let's check it out
df[df.Embarked.isnull()]
| ID | Survived | class | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | family | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN | 0 |
| 829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN | 0 |
We went ahead to search them and discovered they embarked at southampton.
df['Embarked'].fillna('S',inplace=True)
#Let's check out it's distribution
plt=px.pie(df,names='Embarked',title='Embarkation Distribution',color='Embarked',
color_discrete_map = {'C':'deepskyblue','S':'dodgerblue','Q':'lightblue'}, hole = 0.5)
plt.show()
We'll go through this next step by answering some questions about our data relative to the survival column and they include:
Our initial hypothesis for this phase is:
#We want to check the distribution of gender relative to survival
plt = px.histogram(df, x='Sex',title='Gender Distribution by Survival',color='Survived',barmode='group')
plt.show()
Upon calculating the numbers in our data, we discovered that only about 18.9 % of male passengers survived and on the contrary, 74.2 % of female passengers survived. Not to make assumptions so early on but 'Women and Children First' is likely to be a clear cause of the disparity in numbers.
With this difference in numbers we can conclude that there's a higher probability of women surviving rather than the men thanks to the sacrifices of those brave men.
This information will be very useful during the data modelling phase.
Note that each phase of our analysis will build on the previous. With that, let's check out our next column.|
Our next hypothesis for this phase is that:
# As before we'll check the distibution of class relative to survival
plt = px.histogram(df, x='class',title='Class Distribution by Survival',color='Survived',
barmode='group',color_discrete_map={1: 'royalblue', 0: '#EF553B'})
plt.show()
We can see from this distribution that the number of survivors when compared to the number that didn't survive rises as it moves from the 3rd class to the 1st, this observation clearly affirms our hypothesis but let's dive in further into these factors.
plt = px.histogram(df, x='Sex',title='Gender Distribution by Class',color='class'
,barmode='group',color_discrete_map={3:'royalblue',1:'#EF553B',2:'mediumseagreen'})
plt.show()
The differences in distributions for ticket classes is fairly equal in both genders, but only for first and average, It would seem as tho the third class males make up the most of our population.
Let's Check!
print(f"Third Class ticket makes up about {round((347/891)*100)} % of our entire population")
Third Class ticket makes up about 39 % of our entire population
With respect to our grouping, that's quite the number.
Next we want to visualize the connection between our first and second hypotheses to determine how these two factors would interact with respect to our survival.
# Create subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=("Male Distribution", "Female Distribution"))
# Filter data for males and females
df_male = df[df['Sex'] == 'male']
df_female = df[df['Sex'] == 'female']
# Create histograms for each gender
hist_male = px.histogram(df_male, x='class', color='Survived', barmode='group',
color_discrete_map={1: 'royalblue', 0: '#EF553B'})
hist_female = px.histogram(df_female, x='class', color='Survived', barmode='group',
color_discrete_map={1: 'royalblue', 0: '#EF553B'})
# Add histograms to subplots
for trace in hist_male.data:
fig.add_trace(trace, row=1, col=1)
for trace in hist_female.data:
fig.add_trace(trace, row=1, col=2)
# Update layout
fig.update_layout(title_text='Gender Distribution by Survival', showlegend=True)
# Show plot
fig.show()
Alas, the female distribution repersents most clearly shows how our two theories interact with each other, there's a higher number of survivors overall among the female populace and the number of survivors exponentially increases as we go further in class. So we see our two theories gives birth to another conclusion entirely
Females have a higher chances of survival when compared with males and those chances increases as the class level increases.
But don't take my word for it, we'll do further calculations next to affirm or disprove this.
# We want to check the chances of survival for male and female population for each class
def prob(df1, df2):
prob_male = df1.groupby('class').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_female = df2.groupby('class').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_male['probability(%)'] = round((prob_male['survived']/prob_male['total'] * 100),1)
prob_female['probability(%)'] =round((prob_female['survived']/prob_female['total'] * 100),1)
return prob_male,prob_female
prob_male,prob_female = prob(df_male,df_female)
# Create subplots
fig = make_subplots(
rows=1, cols=2,
subplot_titles=("Male Probability", "Female Probability"),
horizontal_spacing=0.1 # Adjust space between plots
)
bar_male = go.Bar(
x=prob_male['class'],
y=prob_male['probability(%)'],
marker_color='mediumseagreen',
)
bar_female = go.Bar(
x=prob_female['class'],
y=prob_female['probability(%)'],
marker_color='mediumseagreen',
)
# Add bar plots to subplots
fig.add_trace(bar_male, row=1, col=1)
fig.add_trace(bar_female, row=1, col=2)
# Update layout
fig.update_layout(
title_text='Probability Distribution',
xaxis_title='Class',
yaxis_title='Probability (%)',
xaxis2_title='Class',
yaxis2_title='Probability (%)',
yaxis=dict(range=[0, 100]),
showlegend=False
)
# Show plot
fig.show()
Well this certainly affirms our hypotheses, males generally have lower chances of survival with first class males being the highest and having only a 37 % chance of survival whilst on the contrary, our data shows first class females have a 97 % chance of surviving with third class having an even 50. All in all, this shows the interaction of our two hypotheses and how they complement each other.
After a quick overview of the evacuation procedure online, our hypothesis for this next bout is that:
''' We want to separate our data into different categories based on age groups
Child : 12 and below
Teenager : (13 - 19)
Adult : (20 - 50)
Elderly : 51 and below '''
df['category'] = df['Age'].apply(lambda x:'Child' if x <= 12 else 'Teenager' if x <= 19 else 'Adult' if x <= 50 else 'Elderly' if x>50 else 'Nan' )
# What our new column looks like
df.head()
| ID | Survived | class | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | family | category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | Adult |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | Adult |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | Adult |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 | Adult |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | Adult |
# Filtering our data to not include null values
d = df[df.category != 'Nan']
# Viewing at our distribution
plt = px.histogram(d, x='category',title='Age Distribution by Survival',color='Survived',
barmode='group',color_discrete_map={1: 'royalblue', 0: '#EF553B'})
plt.show()
Among children, there were more survivors than non-survivors, which stands out as the exception. In every other age category, including teenagers, the number of survivors is lower than those who didn't make it. This outcome aligns with our expectations and reinforces our initial hypothesis. The next step is to examine this trend in relation to gender.
# We want to check the chances of survival for male and female population for each class
def prob1(df1, df2):
prob_male = df1.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_female = df2.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_male['probability(%)'] = round((prob_male['survived']/prob_male['total'] * 100),1)
prob_female['probability(%)'] =round((prob_female['survived']/prob_female['total'] * 100),1)
return prob_male,prob_female
d_male = d[d['Sex'] == 'male']
d_female = d[d['Sex'] == 'female']
prob1_male,prob1_female = prob1(d_male,d_female)
# Create subplots
fig = make_subplots(
rows=1, cols=2,
subplot_titles=("Male Probability", "Female Probability"),
horizontal_spacing=0.1 # Adjust space between plots
)
# Create bar plots for each gender with opacity based on probability
bar_male = go.Bar(
x=prob1_male['category'],
y=prob1_male['probability(%)'],
marker_color='royalblue',
)
bar_female = go.Bar(
x=prob1_female['category'],
y=prob1_female['probability(%)'],
marker_color='royalblue',
)
# Add bar plots to subplots
fig.add_trace(bar_male, row=1, col=1)
fig.add_trace(bar_female, row=1, col=2)
# Update layout
fig.update_layout(
title_text='Probability Distribution',
xaxis_title='Category',
yaxis_title='Probability (%)',
xaxis2_title='Category',
yaxis2_title='Probability (%)',
yaxis=dict(range=[0, 100]),
showlegend=False
)
# Show plot
fig.show()
For male children, chances of survival is 57 % while for females is 59 %, while there does exist a difference, there's nothing considerable about it which shows no gender preference. This is not the case with the remainder of our categories, as the disparity between the populations is still ridiculously significant most notably in the female elderly group which has a 94 % chance of survival compared to its counterpart which has a 13 % chance.
We've now seen the interaction between gender and age, let's look at the connection between age and class.
# We want to check the chances of survival for male and female population for each class
def prob2(df1, df2, df3):
prob_1 = df1.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_2 = df2.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_3 = df3.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_1['probability(%)'] = round((prob_1['survived']/prob_1['total'] * 100),1)
prob_2['probability(%)'] = round((prob_2['survived']/prob_2['total'] * 100),1)
prob_3['probability(%)'] = round((prob_3['survived']/prob_3['total'] * 100),1)
return prob_1,prob_2,prob_3
d_1 = d[d['class'] == 1]
d_2 = d[d['class'] == 2]
d_3 = d[d['class'] == 3]
prob_1,prob_2,prob_3 = prob2(d_1,d_2,d_3)
# Create subplots
fig = make_subplots(
rows=1, cols=3,
subplot_titles=("First Class Probability", "Second Class Probability",'Third Class Probability'),
horizontal_spacing=0.1
)
bar_f = go.Bar(
x=prob_1['category'],
y=prob_1['probability(%)'],
marker_color='royalblue',width=0.5
)
bar_s = go.Bar(
x=prob_2['category'],
y=prob_2['probability(%)'],
marker_color='#EF553B',width =0.5
)
bar_t = go.Bar(
x=prob_3['category'],
y=prob_3['probability(%)'],
marker_color='mediumseagreen',width =0.5
)
# Add bar plots to subplots
fig.add_trace(bar_f, row=1, col=1)
fig.add_trace(bar_s, row=1, col=2)
fig.add_trace(bar_t, row=1, col=3)
# Update layout
fig.update_layout(
title_text='Probability Distribution',
xaxis_title='Category',
yaxis_title='Probability (%)',
xaxis2_title='Category',
yaxis2_title='Probability (%)',
xaxis3_title='Category',
yaxis3_title='Probability (%)',
yaxis=dict(range=[0, 100]),
yaxis3=dict(range=[0, 100]),
showlegend=False
)
# Show plot
fig.show()
We observe that for all age groups other than children, the chances of survival decrease significantly as you move from higher to lower classes. The plot shows that not a single category in the third class has a survival rate up to 50%. This indicates that class was a very significant factor in survival.
For the children age group, we see they have the highest chances in all classes other than the 1st which is held by teenagers, overall it shows that age was also a very significant factor in survival.
Despite previous observations suggesting that population differences did not affect survival in cases of children, this plot reveals a different story. We still see a significant disparity in survival rates: the third class has a survival rate of 41%, meaning that 4 out of every 10 children in the third class survived. In contrast, the survival rate for the second class is 100%, indicating that 10 out of 10 children survived, and for the first class, it is 75%, meaning that 3 out of every 4 children survived. This difference might not solely be attributed to class, as children are generally prioritized during evacuation. However, the data insists on a different narrative.
To examine all 3 factors and the role they play, we'll group our data into six groups to analyze their distribution.
d1_male = d_1[d_1.Sex == 'male']
d1_female = d_1[d_1.Sex == 'female']
d2_male = d_2[d_2.Sex == 'male']
d2_female = d_2[d_2.Sex == 'female']
d3_male = d_3[d_3.Sex == 'male']
d3_female = d_3[d_3.Sex == 'female']
def prob3(df1, df2, df3, df4, df5, df6):
prob_1 = df1.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_2 = df2.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_3 = df3.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_4 = df4.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_5 = df5.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_6 = df6.groupby('category').agg(total=('Survived','count'),survived=('Survived','sum')).reset_index()
prob_1['probability(%)'] = round((prob_1['survived']/prob_1['total'] * 100),1)
prob_2['probability(%)'] = round((prob_2['survived']/prob_2['total'] * 100),1)
prob_3['probability(%)'] = round((prob_3['survived']/prob_3['total'] * 100),1)
prob_4['probability(%)'] = round((prob_4['survived']/prob_4['total'] * 100),1)
prob_5['probability(%)'] = round((prob_5['survived']/prob_5['total'] * 100),1)
prob_6['probability(%)'] = round((prob_6['survived']/prob_6['total'] * 100),1)
return prob_1,prob_2,prob_3,prob_4,prob_5,prob_6
pro_1,pro_2,pro_3,pro_4,pro_5,pro_6 = prob3(d1_male,d1_female,d2_male,d2_female,d3_male,d3_female)
# Create subplots
fig = make_subplots(
rows=3, cols=2,
subplot_titles = ('Male','Female'),
horizontal_spacing=0.03, # Adjust space between plots
vertical_spacing=0.03,
)
bar_m1 = go.Bar(
x=pro_1['category'],
y=pro_1['probability(%)'],
marker_color='royalblue',width=0.3
)
bar_f1 = go.Bar(
x=pro_2['category'],
y=pro_2['probability(%)'],
marker_color='#EF553B',width =0.3
)
bar_m2 = go.Bar(
x=pro_3['category'],
y=pro_3['probability(%)'],
marker_color='mediumseagreen',width =0.3
)
bar_f2 = go.Bar(
x=pro_4['category'],
y=pro_4['probability(%)'],
marker_color='royalblue',width =0.3
)
bar_m3 = go.Bar(
x=pro_5['category'],
y=pro_5['probability(%)'],
marker_color='#EF553B',width =0.3
)
bar_f3 = go.Bar(
x=pro_6['category'],
y=pro_6['probability(%)'],
marker_color='mediumseagreen',width =0.3
)
# Add bar plots to subplots
fig.add_trace(bar_m1, row=1, col=1)
fig.add_trace(bar_f1, row=1, col=2)
fig.add_trace(bar_m2, row=2, col=1)
fig.add_trace(bar_f2, row=2, col=2)
fig.add_trace(bar_m3, row=3, col=1)
fig.add_trace(bar_f3, row=3, col=2)
# Update layout
fig.update_layout(
title_text='Probability Distribution',
yaxis_title='Class 1',
yaxis3_title='Class 2',
yaxis5_title='Class 3 ',
yaxis5=dict(range=[0, 100]),
height=1000,
width=900,
margin=dict(l=50, r=50, t=100, b=50),
showlegend=False
)
# Show plot
fig.show()
Gender and Survival: Women had a significantly higher survival rate (74.2%) compared to men (18.9%), reflecting the "Women and Children First" evacuation policy.
Class and Survival: Survival chances increased with higher ticket classes, with first-class passengers having the best survival rates, particularly among women.
Age and Survival: Children had the highest survival rates across all classes, with minimal gender disparity (57% for boys, 59% for girls). However, survival rates plummeted as we moved down the classes, especially in third class. Among the elderly, women had a 94% survival rate, while men had only 13%. Overall, survival rates were highest in younger age groups and higher classes.
Interaction of Factors: Gender, age, and class together played a crucial role in determining survival, with women and children in higher classes having the best chances of survival.
Exceptions: While children generally had high survival rates, third-class children had notably lower chances, highlighting that class remained a significant factor even among the prioritized groups.